Implemented Coca architecture by VarunS1997 · Pull Request #2371 · keras-team/keras-cv

VarunS1997 · 2024-03-04T20:40:29Z

What does this PR do?

Implements the work done in the "CoCa": Contrastive Captioners are Image-Text Foundation Models" (https://arxiv.org/pdf/2205.01917.pdf).

This PR requires:

Model Implementation
Documentation
Testing

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue? Please add a link
to it if that's the case.
Did you write any new necessary tests?
If this adds a new model, can you run a few training steps on TPU in Colab to ensure that no XLA incompatible OP are used?

divyashreepathihalli

This is great @VarunS1997 !! left a few comments. Also, please add a colab to verify the output.

keras_cv/models/feature_extractor/CoCa/coca_model.py

keras_cv/layers/attention_pooling.py

keras_cv/models/feature_extractor/CoCa/coca_model.py

divyashreepathihalli · 2024-03-05T09:04:38Z

One additional overhead work is needed.

please add keras_cv/models/feature_extractor/coca to this file
https://github.com/keras-team/keras-cv/blob/master/.kokoro/github/ubuntu/gpu/build.sh
to line 72 and 86

PS: we will fix this overhead soon, but in the mean time this is what we need to do to make sure the large GPU tests run.

…each layer with expected sizing

keras_cv/layers/attention_pooling.py

keras_cv/models/feature_extractor/coca/coca_model.py

mattdangerw

Thanks! Left a few comments.

keras_cv/layers/attention_pooling.py

mattdangerw · 2024-03-07T23:52:51Z

keras_cv/models/feature_extractor/coca/coca_model.py

+import numpy as np
+from keras import Sequential
+from keras_cv.api_export import keras_cv_export
+from keras_nlp.layers import RotaryEmbedding, TransformerDecoder


I think we were doing a conditional import of keras_nlp, so keras_cv was installable without keras-nlp installed if you are using unrelated features. But that was when the only use was for one tokenizer.

keras-cv/keras_cv/models/feature_extractor/clip/clip_model.py

Lines 81 to 85 in 5faae37

if keras_nlp is None:

raise ValueError(

"ClipTokenizer requires keras-nlp. Please install "

"using pip `pip install -U keras-nlp && pip install -U keras`"

)

We could reconsider if keras-cv should hard depend on keras-nlp if we want more stuff like this? No strong feelings. @divyashreepathihalli fyi

I think as we support more multi modal models we should depend on KerasNLP. If the tf-text install issue is resolved, we should add it.

Sgtm! Though if we switch to a hard dependency here, we should probably add keras-nlp as a dependency in setup.py (which comes with a transitive dependency on tensorflow-text and tensorflow just fyi).

Do we want to include that in this PR? There's already some imports of Keras-NLP in other places of Keras CV.

If we make it a separate PR, it'll make it easier to rollback if we need to. Considering it's a new dependency, might be worth separating.

keras_cv/models/feature_extractor/coca/coca_model.py

mattdangerw · 2024-03-08T00:06:41Z

keras_cv/models/feature_extractor/coca/coca_model.py

+
+        # [batch_size, sequence_length+1, text_dim]
+        text_tokens = np.concatenate(texts, self.cls_token)
+        mask = np.concatenate((np.ones_like(texts), np.zeros_like(self.cls_token)))


I can't tell if this mask is the right shape or not. Usually you want something (batch_size, seq_length, seq_length) (or seq lenth + 1 if that is the effective sequence length. What is text_dim here?

text_dim is the dimensionality of the text embeddings

divyashreepathihalli

Thanks! I left a few comments
Also, please do add a colab demo to show that the model is working as expected.

divyashreepathihalli · 2024-03-11T16:28:50Z

keras_cv/layers/attention_pooling.py

+        super().build(input_shape)
+
+        self.multi_head_attn.build(input_shape)
+        self.layer_norm.build(input_shape)


is the input shape for layer_norm correct?

Fixed it, let me know if it still doesn't look right!

divyashreepathihalli · 2024-03-11T16:30:01Z

keras_cv/models/feature_extractor/coca/coca_model.py

+        self.image_patching = PatchingAndEmbedding(
+            self.encoder_width, self.img_patch_size
+        )
+        self.image_encoder = Sequential(


Sequential might not work, the model will not build properly.

divyashreepathihalli · 2024-03-11T16:30:38Z

keras_cv/models/feature_extractor/coca/coca_model.py

+        )
+
+        self.text_embedding = RotaryEmbedding()
+        self.unimodal_text_decoder = Sequential(


Again, sequential might not work well. Please double check.

divyashreepathihalli · 2024-03-11T16:30:53Z

keras_cv/models/feature_extractor/coca/coca_model.py

+                for _ in range(self.unimodal_decoder_depth)
+            ]
+        )
+        self.multimodal_text_decoder = Sequential(


same comment about Sequential

divyashreepathihalli · 2024-03-11T16:32:32Z

keras_cv/models/feature_extractor/coca/coca_model.py

+        num_patches = (images_shape[1] // self.img_patch_size) * (
+            images_shape[2] // self.img_patch_size
+        ) + 1
+        self.image_encoder.build((batch_size, self.encoder_width, num_patches))


you could keep batch_size as None
example
self.image_encoder.build((None, self.encoder_width, num_patches))

Just for my understanding, is there a specific reason to do that over setting the batch_size?

keras_cv/layers/attention_pooling.py

…hanges

sachinprasadhs · 2024-03-28T21:45:17Z

keras_cv/models/feature_extractor/coca/coca_model.py

+    contrastive loss from the ViT and the uni-modal Text Decoder is combined with a captioning loss from the multi-modal
+    Decoder in order to produce the combined total loss.
+
+    Basic Usage:


This has to be changed to Example: since we follow only Example or Examples: as a standard format.

divyashreepathihalli · 2024-08-16T23:16:18Z

@VarunS1997 will you be completing this one?

Implemented Coca architecture

20cdf41

VarunS1997 added the wip working in progress from KerasCV team label Mar 4, 2024

VarunS1997 added 3 commits March 4, 2024 12:48

Minor clean-up

b8c0ba4

Fixed depth of decoders

bbe17c4

Updated config to match args

202526f

divyashreepathihalli added the kokoro:force-run Runs Tests on GPU label Mar 5, 2024

kokoro-team removed the kokoro:force-run Runs Tests on GPU label Mar 5, 2024

divyashreepathihalli suggested changes Mar 5, 2024

View reviewed changes

keras_cv/models/feature_extractor/CoCa/coca_model.py Outdated Show resolved Hide resolved

keras_cv/models/feature_extractor/CoCa/coca_model.py Outdated Show resolved Hide resolved

divyashreepathihalli suggested changes Mar 5, 2024

View reviewed changes

VarunS1997 added 4 commits March 5, 2024 14:08

Moved layer definitions to build and added build calls for each layer

367dd39

Unabbreviated 'contrastive' and 'captioning'

80ea7d3

Improved documentation and added output sizing to call(), also built …

3feacb6

…each layer with expected sizing

Lowercased coca model directory and added to kokoro build

f15408f

VarunS1997 requested a review from divyashreepathihalli March 5, 2024 23:43

mattdangerw reviewed Mar 7, 2024

View reviewed changes

keras_cv/layers/attention_pooling.py Outdated Show resolved Hide resolved

mattdangerw reviewed Mar 7, 2024

View reviewed changes

keras_cv/models/feature_extractor/coca/coca_model.py Outdated Show resolved Hide resolved

mattdangerw reviewed Mar 8, 2024

View reviewed changes

Addressed comments by Matt; reformatted as well

960873f

divyashreepathihalli added the kokoro:force-run Runs Tests on GPU label Mar 11, 2024

kokoro-team removed the kokoro:force-run Runs Tests on GPU label Mar 11, 2024

VarunS1997 requested a review from mattdangerw March 11, 2024 16:27

divyashreepathihalli suggested changes Mar 11, 2024

View reviewed changes

innat-asj reviewed Mar 12, 2024

View reviewed changes

keras_cv/layers/attention_pooling.py Show resolved Hide resolved

VarunS1997 added 4 commits March 13, 2024 12:30

Addressed comments related to attn pooling size, attn pooling name

33cff54

Wrote a test for coca saving and loading, which prompted some model c…

145d7b5

…hanges

Updated to functional model

e8623a9

added size inputs for functional model

c9e1ec1

sachinprasadhs reviewed Mar 28, 2024

View reviewed changes

	if keras_nlp is None:
	raise ValueError(
	"ClipTokenizer requires keras-nlp. Please install "
	"using pip `pip install -U keras-nlp && pip install -U keras`"
	)

Conversation

VarunS1997 commented Mar 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Uh oh!

divyashreepathihalli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

divyashreepathihalli commented Mar 5, 2024

Uh oh!

Uh oh!

Uh oh!

mattdangerw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

divyashreepathihalli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

divyashreepathihalli commented Aug 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

VarunS1997 commented Mar 4, 2024 •

edited

Loading